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In this paper, we study a class of non-parametric density estima¬ 
tors under Bayesian settings. The estimators are piecewise constant 
functions on binary partitions. We analyze the concentration rate of 
the posterior distribution under a suitable prior, and demonstrate 
that the rate does not directly depend on the dimension of the prob¬ 
lem. This paper can be viewed as an extension of ([12]) where the 
convergence rate of a related sieve MLE was established. Compared 
to the sieve MLE, the main advantage of the Bayesian method is that 
it can adapt to the unknown complexity of the true density function, 
thus achieving the optimal convergence rate without artificial condi¬ 
tions on the density. 


1. Introduction. In this paper, we study the asymptotic behavior of 
posterior distributions of a class of density estimators based on adaptive 
partitioning. Density estimation is a fundamental problem in statistics— 
once an explicit estimate of the density function is obtained, various kinds of 
statistical inference can follow, including nonparametric testing, clustering, 
and data compression. 

With univariate (or bivariate) data, the most basic non-parametric method 
for density estimation is the histogram method. In this method, the sample 
space is partitioned into regular intervals (or rectangles), and the density is 
estimated by the relative frequency of data points falling into each interval 
(rectangle). However, this method is of limited utility in higher dimensional 
spaces because the number of cells in a regular partition of a p-dimensional 
space will grow exponentially with p, which makes the relative frequency 
highly variable unless the sample size is extremely large. In this situation 
the histogram may be improved by adapting the partition to the data so 
that larger rectangles are used in the part of the sample space where data is 
sparse. Motivated by this consideration, researchers have recently developed 
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several multivariate density estimation methods based on adaptive parti¬ 
tioning. For example, by generalizing the classical Polya Tree construction 
([3]), [21] developed the Optional Polya Tree (OPT) prior on the space of 
simple functions. In this prior the partition that supports the simple function 
is generated by a random recursive partitioning process. As the partition is 
random a priori, it can be inferred from its posterior distribution once the 
data is observed. Computational issues related to OPT density estimates 
were discussed in [14], where efficient algorithms were developed to compute 
the OPT estimate. In [14], a different way to construct the random partition 
is introduced where the size of the partition grows linearly instead of geo¬ 
metrically as in OPT. This allows the authors to use sequential importance 
sampling to sample from the posterior distribution. This Bayesian Sequential 
Partition (BSP) method is computationally more scalable to higher dimen¬ 
sions than the OPT method. As an application, the methods were used to 
estimate within-class densities in classification problems, thereby obtaining 
approximations to the Bayes classifier. When tested on standard data sets 
with p ranging from 10-50, the results are competitive to those from leading 
classification methods such as SVM and boosted tree. 

The purpose of the current paper is to address the following questions 
on such Bayesian density estimates based on partition learning. Question 1: 
what is the class of density functions that can be well estimated by these 
methods. Question 2: what is the rate in which the posterior distribution 
is concentrated around the true density as the sample size increases? For 
question 1, our analysis will make use of some results from a companion 
paper [12] on the properties of sieve MLEs where the sieve is constructed 
by considering simple functions supported by binary partitions of growing 
sizes. Specifically, [12] showed that if the true density can be approximated 
in Bellinger distance at a rate of where I is the size of the partition, then 
the convergence rate of the sieve-MLE density estimate is up 

to logn terms, where n is the sample size. We note that the term “well 
estimated” in question 1 can now be given a more specific meaning, namely 
that the convergence rate of the estimate should not deteriorate fast when 
the dimension p of the sample space is large. [12] gave examples of functions 
for which approximation rate is not affected by p much. These include 
functions satisfying mixed-Holder continuity conditions or functions with 
spatial sparsity as characterized by fast decay of Haar wavelet coefficients. 
It is well known that sieve MLEs are closely related to penalized estimates 
which is in turn related to Bayesian methods ([20], [16] and [17]). Thus we 
expect that the class of density well estimated by the Bayesian methods 
should be the same class analyzed by [12], i.e. the class of densities that can 


DENSITY ESTIMATION VIA ADAPTIVE PARTITIONING 


3 


be approximated at rate for some r > 0. We will see that this is indeed 
true as a consequence of our main result. Our main result (Theorem 2.1) 
also provides the answer to the second question: it shows that the posterior 
probability is concentrated in a shrinking Hellinger ball around the true 
density, where the radius of the ball is up to logn terms. 

Although the convergence rate of the Bayesian method matches that of the 
sieve MLE, there is an important difference. While this rate is achieved by 
the Bayesian method without requiring any knowledge of the constant r that 
characterizes the complexity of the true density function, the sieve MLE can 
achieve this same rate only if the size of the sieve grows at a rate that depends 
on r, specifically, the size of the partition must be of order In 

other words, the Bayesian estimate is adaptive to the complexity of the 
true density while the sieve MLE is not. This is an important difference in 
practice. 

We now briefly review previous literature on convergence rate of poste¬ 
rior distributions. In breakthrough works [6] and [17], the authors developed 
general theory on posterior convergence rates and discussed several appli¬ 
cations. Following this theory, most results have focused on mixture models 
([13] and [4]), because these models allow the study of smooth density func¬ 
tions. Some elegant works include [7] and [8], which studied the concentra¬ 
tion rate of the posterior distribution under Dirichlet mixtures of Gaussian 
priors, and [5] and [15], which examined the posterior concentration rate 
under the mixtures of Beta priors. Compared to the previous literature, one 
major improvement of our result is that it can deal with multivariate cases. 
In particular, the rate attained by our estimate is independent of the di¬ 
mension p, if the true density falls within the support of the prior. When 
specialized to the univariate case, it still coincides with the previous results. 
For instance, for one dimensional Holder space with parameters between 
0 and 1, our result is minimax up to a logn term. Another contribution is 
that our result can adapt to the unknown complexity of the density function. 
There has been few adaptive rate results for Bayesian density estimates in 
the literature (see [10] for a more extensive review of recent results on adap¬ 
tive posterior concentration rates). A notable exception is in [15], where the 
author obtained adaptive posterior concentration rates for one-dimensional 
Holder spaces under mixture Beta priors. Here, our result can adapt to a 
broader range of density functions, including spatially sparse density func¬ 
tions, Holder continuous functions, and functions of bounded variation. We 
gain this advantage at a cost of relatively poor performance for functions 
with higher order smoothness. It is our belief that in the multivariate case, 
smoothness is not the best condition to characterize functions that can be 
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well estimates. The rate under usual smoothness condition is 

([19]), where n is the number of derivatives. Thus high order smoothness 

cannot guarantee good convergence when p is large. 

The article is organized as follows. In Section 2 we dehne the prior dis¬ 
tribution and summarize our main results on posterior concentration rate. 
We express the posterior measure of the complement of a Hellinger ball as a 
ratio, where the numerator is the product of prior probability and the likeli¬ 
hood, and the denominator is the normalizing factor. In order to derive the 
concentration rate, we need to upper bound the numerator and lower bound 
the dominator. In Section 3 and Section 4, we discuss these upper and lower 
bounds respectively. Finally, in Section 5, we combine these results to derive 
the posterior concentration rate. 

2. Main results on posterior concentration rate. In this paper, 
we focus on the density estimation problem in the p-dimensional Euclidean 
space. Let (II, B) be a measurable space and /o be a compactly supported 
density function with respect to the Lebesgue measure p. ¥±,¥ 2 , ■ ■ ■ , W is a 
sequence of independent variables distributed according to /q. After trans¬ 
lation and scaling, we can always assume that the support of /o is contained 
in the unit cube in M^. Translating this into notations, we assume that H = 
{(y^) • • • ) y^) • y^ £ [Oj l]}- .r = {/ is a nonnegative measurable function 

on n : J^fdp = 1} denotes the collection of all the density functions on 
Then T constitutes the parameter space in this problem. Note 
that J- is an infinite dimensional parameter space. 

2.1. Densities on binary partitions. To address the infinite dimensional¬ 
ity of ¥, we construct a sequence of finte dimensional approximating spaces 
01,02,-•• ,0/,--- based on binary partitions. With growing complexity, 
these spaces provides more and more accurate approximations to the initial 
parameter space T. Here, we use a recursive procedure to define a binary 
partition with I subregions of the unit cube in M^. Let H = {(y^, • • • , y^) : 

y* G [0,1]} be the unit cube in In the first step, we choose one of the 
coordinates y^ and cut D into two subregions along the midpoint of the 
range of yK That is, D = Hq U where Hq = {y G H : y^ < 1/2} and 

= n\IIg. In this way, we get a partition with two subregions. Note that 
the total number of possible partitions after the first step is equal to the 
dimension p. Suppose after I — 1 steps of the recursion, we have obtained 
a partition with I subregions. In the I-th step, further partitioning 

of the region is dehned as follows: 

1. Choose a region from Hi, • • • , H/. Denote it as Hjp. 
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Fig 1. Binary partitions 


2. Choose one coodinate and divide into two subregions along the 
midpoint of the range of yK 

Such a partition obtained by / — 1 recursive steps is called a binary partition 
of size I. Figure 2.1 displays all possible two dimensional binary partitions 
when / is 1, 2 and 3. 

Now, let 

7 I 

Qi = {f e e : f = /3i/7(ni) = 1, 

i=l i=l 

is a binary partition of of size /.}. 

Then, 0/ is the collection of the density functions supported by the binary 
partitions of size I. They constitute a sequence of approximating spaces (i.e. 
a sieve, see [9] and [18] for background on sieve theory). Let 0 = U|T^0/ 
be the space containing all the density functions supported by the binary 
partitions. Then 0 is an approximation of the initial parameter space J- to 
certain approximation error which will be characterized later. 

We take the metric on T", 0 and 0/ to be Hellinger distance, which is 
defined to be 

pif, S') = ( / iVIiy) - /, S' e ©• 

Jn 


( 2 . 1 ) 
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For f,g e 0/, let / = 9 = where and 

are binary partitions of Q. Then the Hellinger distance between fj 
and ff can be written as 


(2.2) 

pHf,9) = n ^)). 

i=i j=i 

We will aso use Kullback-Leibler divergence and the variance of the log- 
likelihood ratio based on a single observation Y, which are defined to be 

(2.3) 

K(hJ) =EfiloejF), 

and 


(2.4) 



2.2. Approximation error. The accuracy of the approximation to the 
true density by the elements in 0 is formulated in the following way. A 
density function f € iF is said to be well approximated by elements in 0, if 
there exits a sequence of // G 0/, satisfying that p{fi,f) = 0{I~'^){r > 0). 
This means that there exists constant Ai and A 2 , such that Ail~'^ < 
rniugg©^ p{g, f) < p{fi, f) < A 2 l~^. Let Tq be the collection of these density 
functions. We will first derive posterior concentration rate for the elements 
in Fq in terms of the parameter r. For different function classes, this ap¬ 
proximation rate r can be calculated explicitly. This type of results has been 
discussed in a parallel paper ([12]). In addition to this, we also assume that 
/o has finite second moment. 

We want to point out that, based on the minimaxity of the Bayes estima¬ 
tor, it is necessary to restrict our attention to a subset of F. In [2] and [1], 
the authors demonstrated that it is impossible to find an estimator which 
works uniformly well for every / in F. This is the case because for any 
estimator /, there always exists f € F for which / is inconsistent. 

2.3. Prior specification. An ideal prior 11 on 0 = U ^^07 is supposed to 
be capable of balancing the approximation error and the complexity of 0. 
The prior in this paper penalizes the size of the partition in the sense that 
the probability mass on each 0/ is proportional to exp(—Allog/). Given 
a sample of size n, we restrict our attention to 0^ = Uj^^ 07 , because 
in practice it is not meaningful to study a partition with the number of 
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subregions greater than the sample size. This is to say, when I < n, 11(0/) oc 
exp(—A/logI), otherwise 11(0/) = 0. 

If we use T/ to denote the total number of possible partitions of size I, 
then it is not hard to see that logT/ < c*/log/, where c* is a constant. 
Within each 0/, the prior is uniform across all binary partitions. In other 
words, let be a binary partition of II of size I, and T'({ni}|^]^) is 

the collection of piecewise constant density functions on this partition (i.e. 

= if = = 1 and 0, > 0,i = 1,...,/}), 

then 


(2.5) n (.T({IIi}ii)) oc exp(-AIlog/)/r/. 

Given a partition the weights 9i on the subregions follow a trun¬ 

cated Dirichlet distribution with parameters all equal to a (a < 1). This is 
to say, for xi, • • • , x/ > r and Ei=i = 1) 


n (^/ = E : 01 G dxi, • • • , 0/ G dxi\f G T ({aiEi) 


(2.6) oc 


r(a/) 

(tW 


n 

2=1 


,a-l 

'i ’ 


otherwise, the prior probability is zero, r is the truncation parameter. In 
this paper, we set r to be DI~'^ {D, k > 0). 


2.4. Posterior concentration rate. We are interested in how fast the pos¬ 
terior probability measure concentrates around the true the density /q. Un¬ 
der the prior specified above, the posterior probability is the random measure 
given by 


n(B|yi,-- - ,Yn) = 


///n-=i/(E)rfn(/) 

/en;=i/(EW/)- 


A Bayesian estimator is said to be consistent if the posterior distribution 
concentrates on arbitrarily small neighborhoods of /o, with probability tend¬ 
ing to 1 under Pq (Pq is the probability measure corresponding to the den¬ 
sity function /o). The posterior concentration rate refers to the rate at which 
these neighborhoods shrink to zero while still possessing most of the poste¬ 
rior mass. More explicitly, we want to find a sequence —)• 0, such that for 
sufficiently large M, 


n(/ : p{f, fo) > Men\Yi, ■■■ ,Yn) ^ 0 in P^ - probability. 

The following theorem gives the posterior concentration rate under the 
prior probability specihed in Section 2.3. 
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Theorem 2.1. Yi, ■ ■ ■ ,Yn is a sequence of independent random variables 
distributed according to /q. Pq is the probability measure corresponding to 
/o- 0 is the collection of all the p-dimensional density functions supported 
by the binary partitions as defined in Section 2.1. The prior distribution 
on 0 is as specified in Section 2.3. If fo G To o.nd k > max(2,4r), then 
= n 2 r+i is posterior concentration rate. 

The strategy to show this theorem is to write the posterior probability 
measure as 


(2.7) 


E/=l /{/:p(/,/o)>Me„}ne7 nj=l 


Er=i fe, nu 


/(T) 

MY) 


dn{f) 


The proof still relies on the mechanism developed in the landmark works [6] 
and [17]. We first derive the upper bounds for the items in the numerator by 
employing previous results from the study of empirical process in Section 3. 
Then we lower bound the prior mass of the shrinking ball around the true 
density in Section 4. In Section 5, these bounds are integrated together, 
leading to a complete proof of the posterior concentration rate. 


2.5. Discussion. 


2.5.1. Comparison to the sieve MLE. In the companion work [12], we 
studied convergence rate of the sieve maximum likelihood estimators. In 
that paper, the approximating spaces 0/ are defined in the same way, and 
we consider the same subset of density functions Tq. 

For any / G 0/, the log-likelihood is defined to be 

n I 

Lnif) = ^log/(yj) = ^ Wlog/3i, 

j=l i=l 

where W is the count of data points in Hj, i.e., W = card{j : Yj G Hi, 1 < 
j Y n}. The maximum likelihood estimator on 0/ is defined to be 

fnj = argmaxL„(/). 

Next theorem presented the result on convergence rate of sieve MLE. It is 
cited from [12]. 
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Theorem 2.2. For any /o G To, fn,i is the corresponding maximum 
likelihood estimator over Qj. r is the parameter that characterizes the decay 
rate of the approximation error to fo by the elements in Qj. Assume that n 
and I satisfy 

( 2 , 8 ) 

where the constant ci can be chosen to he in (0,1), and A 2 is a constant asso¬ 
ciated with the decay rate of the approximation error. Then the convergence 
rate of the sieve MLE is n (iogn )^2 2 r+i)_ 

Comparing these two rates, we can easily see that they are of the same 
order up to a logarithmic term. However, for the sieve method, in order to 
achieve the optimal convergence rate we need to match the size of the par¬ 
tition I to the sample size n. And this matching depends on some unknown 
property of the true density function, i.e., the decay rate of the approxima¬ 
tion error r. This implies that, in practice it is computationally infeasible 
to achieve optimal rate under the frequency setting. On the other hand, 
under Bayesian settings, by imposing a prior on 0/, we are able to achieve 
the optimal rate without any a priori information. This is one of the major 
improvements of the Bayesian method. 

2.5.2. Computational issues. The total number of binary partitions grows 
exponentially in I, thus it is urgent to solve the computational issues. In 
[14], as we mentioned before, the authors imposed a very similar prior dis¬ 
tribution. By employing sequential importance sampling, they have designed 
efficient algorithm to sample from the posterior distribution. Currently, the 
dimension of the problem can be moderately large, saying around 50. 

2.5.3. Applications to different function classes. In the parallel paper, 
we studied decay rates of the approximation error for different density func¬ 
tions classes, including the densities satisfying a type of sparsity, the space 
of bounded variation, and mixed-Holder continuous functions. Since in this 
paper we use the same approximating spaces, those results still hold. Given 
this, we can also calculate the corresponding rates of posterior contraction. 
Based on the minimaxity of Bayesian estimator, these rates are at least 
upper bounds of minimax convergence rates. In fact, for the one dimen¬ 
sional density functions of bounded variation, the posterior contraction rate 
is n“^/^(logn)^/^. If we estimate the density by wavelet thresholding, the 
convergence rate is n“^/^(logn)^/^. As a benchmark, the minimax rate of 
convergence is 



10 


L. LIU AND W. H. WONG 


2.5.4. Univariate case. In [15], the author investigated rates of conver¬ 
gence for the posterior distribution under the mixture of Beta prior. The 
true density function is assumed to be Holder continues on [0,1]. More rig¬ 
orously, the class of Holder functions 'H{L,j3) with regularity function /? is 
defined as the following: let k be the largest integer smaller than /3, and 
denote by its Kth derivative. 

H(L,/3) = {/ : [0,1] ^ M : |/W(x) - < L\x - 

Then, under a class of location mixtures of Beta models, the concentration 
rate of the posterior distribution is up to a logn term. It is 

known that the rate {n/ is the minimax rate of convergence 
for class H(/3, L). 

Under the prior distribution specified in this paper, we can also study the 
posterior contraction rate for the Holder class. However, given the piecewise 
constant approximations, we will only study the Holder continuous function 
on [0,1] with regularity parameter /? in (0,1]. For this class of density func¬ 
tions, we already calculated the decay rate of the approximation error in [12]. 

Then the convergence rate of the posterior distribution is n ^/s+i (log n) ^ 
Up to a logn term, this method still achieves the minimax rate of conver¬ 
gence. 

3. Upper bound of the numerator. Briefly speaking, the numerator 
can be bounded by controlling the complexity of the parameter space 0. 
Here, the complexity of the model is measured by the metric entropy. A 
general discussion of metric entropy can be found in [11]. In this section, 
we introduce a form of metric entropy with bracketing corresponding to 
the relavent parameter space, and provide an upper bound for the metric 
entropy of the approximating spaces dehned in Section 2.1. These bounds 
lead to upper bounds for the items in the numerator of (2.7). 

Definition 3.1. Let (0,p) be a seperable pseudo-metric space. 0(e) is 
a finite set of pairs of functions {(//', ff)^j = !>''' > satisfying 

(3.1) p{fj^,fj^)<eforj = l,---,N, 
and for any / G 0, there is a j such that 

(3.2) // < / < /f • 

Let 

(3.3) N{€,Q,p) = min{card 0(e) : (3.1) and (3.2) are satisfied}. 
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Then, we define the metric entropy with bracketing of 0 to be 

(3.4) H{e,e,p) =logN{e,e,p). 

Recall that 0i, • • • , 0/, • • • are the approximating spaces defined in sec¬ 
tion 2.1. The next lemma is devoted to an upper bound for the bracketing 
metric entropy of 0/. 

Lemma 3.1. Take p to be the Hellinger distance. Let Qj = {f £ Qj : 
p{f,fo) < d}. Then, 

H{e,ej,p) 

(3.5) < Ilogp + (/+ 1) log(/+ 1) + ^ log/+/log --h c', 
where c is a constant not dependent on I or d. 

Proof. See [12] proof of Lemma 3.1 and Lemma 3.2. □ 

Our next theorem, which is Theorem 1 in [22] , gives a uniform exponential 
bound for likelihood ratios. 


Theorem 3.1 (Wong and Shen (1995)). There exist positive constants 
a > 0, c. Cl and C 2 , such that, for any e > 0, if 

f\/2e 

(3.6) / p)du < cn^^'^e^, 

^ 2/8 

then 

'P/o( sup n > exp(-cine^)) < 4exp(-C2ne^), 


where is understood to be the outer probability mesure under /q. The 
constants ci and C 2 can be chosen in (0,1) and c can be set as (2/3)®/^/512. 


Finally, the next lemma provides an upper bound for the items in the 
numerator in (2.7) when I is sufficiently large. 

Lemma 3.2. Let 6n,i = ^ ^ sufficiently large, 

we have 

P/o( sup ^ > exp(-cin<52 j)) < iexp{-C2n5l j). 

^ {p(/,/o)>-5„,7,/eeR mLi) ’ 
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Proof. See [12] proof of Corollary 3.1. 


□ 


Remark 3.1. Since the metric entropy decreases as e increases, this 
lemma also holds for any e > 5n,i- This property is quite useful in the proof 
of the main theorem. 

4. Lower bound of the denominator. In this section, we study how 
the prior distribution concentrates on the shrinking neighborhoods around 
the true density function. This is the key to bounding the denominator of 
(2.7) from below. We develop our results through a series of lemmas. The 
connection between the lower bounds of the items in the denominator of (2.7) 
and the concentration rate of the prior distribution is first derived (4.1). By 
employing a property of Dirichlet distribution (Lemma 4.3) and inequalities 
bounding Kullback-Leibler divergence by Hellinger distance (Lemma 4.2), 
we obtain lower bounds of the terms in the denominator of (2.7) in Lemma 


4.4. 


To begin with, we cite a result from [17]. In this lemma, it is shown that 
with probability close to 1, the denominator is bounded from below by the 
prior probability mass concentrating on a ball around /o multiplied by a 
coefficient depending on the radius of the ball. 

Lemma 4.1 (Shen and Wasserman (2001) Lemma 1). Let and 

be as defined in (2.3) and (2.4), and let S{t) = {/ G R : K{fo,f) < 
t, ^(/o) /) < i}- Set Sn = S{tn). When tn is a sequence of positive numbers 
satisfying ntn oo, 



More explicitly, from this lemma we learn that, given the condition ntn —)• 
/nnj=i ^^(y ) dll(/) > ^Il(Sn)e~‘^^^" with probability close to 1. 

It is well known that Hellinger distance can be bounded by the Kullback- 
Leibler divergence. In [22], they showed that the other direction also holds 
under an integrability condition. Their results are summarized in the lemma 
below. 

Lemma 4.2 (Wong and Shen (1995) Theorem 5). Let f, /o be two densi¬ 
ties, p^{f, fo) < e^. Suppose that M| = foifo/fY < oo for some 




DENSITY ESTIMATION VIA ADAPTIVE PARTITIONING 


13 


6 G (0,1]. Then for all < ^{1 — e we have 

f ^ 1 f ^ \a I 2 log 2 8 I 2 

J /olog(y)< [6 + y _ ^ max (1, log(—)) J e , 

j /o(log(y))^ < 5e2[^max(l,log(^))]^ 

From the proceeding lemma, we learn that, if p'^{f,fo) < e^, then 

max (^iF(/o,/),E/o ((log = 0{e^{log^f). 

This further implies that, there exists a constant L, such that 

(4.1) C {/ : K{h,f) < e2,E^,((log|ffi)") < e"}. 

This lemma allows us to work on a Hellinger ball instead of a Kullback- 
Leibler one. The transition is necessary because it is more straightforward 
to apply a property of the Dirichlet distribution to estimate the probability 
mass on a Hellinger ball around the true density function. In the lemma 
below, this particular property of the Dirichlet distribution is stated in terms 
of Li distance, which is equivalent to the Hellinger distance. We want to 
point out that this lemma is a variation of Lemma 6.1 in [6] and the proof 
is adapted from their paper. 

Lemma 4.3. Let {Xi,--- ,Xj) be distributed according to the truncated 
Dirichlet distribution (2.6) with truncation parameter t. Let (xio,--- ,xio) 
be any point on the L-simplex. Let e < 1//. Assume that r < e^. Then 

(4.2) F(E|X.-i„|<2t)>i;|^(t=-T)'. 

i=l \ V // 

Proof. We can find an index i such that Xio > 1/L. By relabeling, we 
can assume that i = /. if \xi — Xio\ < for i = 1, - ■ ■ ,1 — 1, then 


i-i 

'^Xi < 1 - xio + {L - l)e^ < {I - l)(e^ + ^/I) < 1 - < 1. 

i=l 
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Therefore, there exists x = {xi,--- ,xi) in the simplex with these first 
I — 1 coordinates. And 

I i-i 


\xi - Xio\ <2'^\xi- Xio\ < 2e^{I - 1) < 2e. 

i=l i=l 


Therefore, the probability on the left hand side of (4.2) is bounded below 
by 


> 


P{\Xi - Xiol < e^, f = 1, • • • , / - 1) 


r(a/) 

(tW 


^ 1 j•min({xio+e^),l) 

n / 

i=l Jrnax{{xiQ-e'^),T) 




Since a < 1, we can lower bound the integrand by 1 and the interval of 
integration contains at least an interval of length — r. Therefore, the 
result above can be further lower bounded by 


r(oJ) , 2 
(r(a))'' 




r(gf) , 2 
(r(o))'^ 


r)’ 


This finishes the proof. 


□ 


Now, we are ready to derive lower bounds for the prior probability mass 
on 0/’s when I varies within a certain range. Before stating the result, we 
want to briefly review the assumptions we made in Section 2.2 and Section 
2.3. First, in terms of approximation error, we assume that for any /q G J-q, 
there exists a sequence of fj G 0/, such that < mirig^Qj p{g, f) < 

p{fi, f) < A 2 l~^ for some positive constants Ai and A 2 . Second, we imposed 
a moment condition on Tq. For any / G Tq, we assume that J f‘^ < 00. 
At last, given a partition of size I, the weights on the subregions within 
the partition follow a Dirichlet distribution truncated from below, with the 
truncation parameter r = DI~'^ (D, k > 0). Under these three assumptions, 
we will derive the lower bound in the lemma below. 


Lemma 4.4. Assume that Jq € Pq. H is the prior probability specified in 
Section 2.3, with k > max(2,4r). Let tn,i = i — n/^iogn ■ ^ , 

we have 


< 


fo 


e. 




< -n(0/)exp(-2nt„,/ - 


TXtn,I 


c*I\ogI — 40;/log n — /logF(a 
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where oj = max(l, l/2r). 


Proof. Let Sn,i = {/ G 0/ : K{fo,f) < tn,i,V{fo,f) < tn,i}- From 
lemma 4.1, we have the bound 


(4.3) 


/o 


n 




1 


Or L\ 


dU{f) < -UiSnj)e-^^^--n < 




Next step, we will search a lower bound for n(S'„_ 7 ). The way to approach 
this is to find a subset of Snj to which we can apply Lemma 4.3. Our 
argument is as the following. 

Define Sn,i = {/ G 0/ : K{fo,f) < ((log Note 

that E/o((logyj^)^) > V{fo,f), we have Sn,i C Sn,i- From (4.1), we know 
that 


Wn,i := {/ G 0/ : p{fo, f) < 


Le. 


n,/ 


log 


Ms 

^n.I 


} C Sn,I. 


With the truncation parameter r = DI Ms = 0(1^^ f /o^"*"'^^). Further¬ 


more, 




log 


Ms 

^n.I 


= O 


/log I 

n/ log n 


1/2 




(4.4) 




Under the assumptions that I = , there exists fj G 0/, such that 

p{fo, fi) < .^^"ms ■ If we define 


Wn,i := {f e Qi : p{f, fi) < 


Le 


n.l 


log/^ 


- pifoJi)}, 


by triangle inequality, we know that Wnj C Wnj- Together with the previ¬ 
ous result, we claim that there exists a constant L', such that 

Bn,l := {/ G 0/ : p(/, fi) < L'{' } C 

\ n log n J 

Next, from the fact p^{f-,g) < ||/ — qWli-, we have 

Bn,I := {/ G 0/ : \\fi - fh, < C Bn,i. 


nlogn 
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Note that = n( 0 /)n(i?„^ 7 -| 07 ). Assume that // is supported by 

the binary partition {Oio}f=i. Let Fq = {/ G 0/ : / = AlOio; A > 

0, A = 1} Le the collection of all the density functions in 0/ which are 
supported by the same binary partition as //. Then 


(4.5) U{Bn,i\ei) > n(F„,,|Fo)n(Fo|0/) > exp(-c*Ilog/)n(F„,7|Fo). 


Now we apply Lemma 4.3 to bound Il{Bnj\FQ) from below. We will works 

with an Li-ball with radius { ^niogn^ )‘^i where io is chosen to be max(l, l/2r). 

We can always assume that L' < 1, otherwise we can work with a smaller 

1 

ball instead. Obviously, this ball is contained in Bnj. When I = , 

we have { ^niogn ^ ^ 7' Under the assumptions k > max(2,4r), we know 
that when I > ^ DI~'^ = By setting x^o in the lemma to 

probability mass on rijo under //, we have 


(4.6) 


n(i3„^/|Fo) 


> 

> 


r(a/) L^^JlogF ^ 

(r(a))'^ 2nlogn 

exp(—/logr(a) — 4a;Ilogn). 


Combine (4.3), (4.5) and (4.6) together, we get the desired result. □ 


5. Proof of Theorem 2.1. In this section, we will combine the upper 
bound in Section 3 and the lower bound in Section 4 together to derive the 
posterior concentration rate. 


Proof of Theorem 2.1. Let €n = n (logn)^+ 2 r andrjnj = 
First, we divide the items in (2.7) into three blocks. We define 


Ni-l 


iNum — ^ ^ 


n 


/(U) 


/=i J {/;p(/Jo)>Me„}ne/ MYj) 

N2 f. 

IlNum — ^ ^ 


dn(/), 


I=Ni ■ 
n 

IIlNum — ^ ^ 

/=N2 + 1 


'{Lp(/Jo)>M£„}ne7 fo{Yj) 


n M<'n(/). 


'{/:p(/Jo)>Afe„}ne/ MY,) 


n M‘'n(/), 


1 _ 1 1 
where Ni = n 2 ’'+i (logn) >• and N 2 = (logn)^. 

We deal with each block in the numerator separately. Roughly speaking, 

when I is small, the approximation error to /o dominates, and these items 
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can be bounded by the Hellinger distance between / and fo- The items in the 
middle range can be bounded by controlling the metric entropy of 0/. The 
items in the last block are negligible because the prior probability decays to 
zero fast. 

We assume that there exists a sequence of // G 0/, such that Ail~^ < 
ming^Qj p{g, f) < p{fi,f) < A 2 l~^ for some positive constants Ai and ^ 2 - 
When I < Ni, Ail~'' is greater than e^- We can apply Lemma 3.2 by setting 
6n,i to be Ail~^. Therefore, as n —)> oo, 

Ai-l 

iNum < ^ n(0/)exp(-^in/"^'’) 

I=l 

Ni-1 

< ( ^ exp(-2Ain/-2"))V2. 

i=i 

Now, we will estimate the order of the summation in the last line. In order 
to simplify the notation, we will discuss the order of exp(—in 

detail. 

We know that the mass is centered around I = iVi ~ 1- Power series 
expansion around that point gives 


< (1 — e)Ni exp 

7=1 


2ylin \ 


which is a lower order term compared to the last term in the summation 
and thus does not contribute significantly to the summation. Let 1 — 5 = 
expand 


(1 - 5)-2-= 1 + 2r5 + (^ + o{5^). 


Ni-l 

^ exp( 

/=(1-6)Ni 


2Ain 

J2r 


< 


/ exp(- ^)dx 

f exp 274in2’’+i (logn)^(l — 5)~^^'^ Nid5 


exp ( —274in2’’+i (logn)^(l + 2r5 + o((I))) Nid5 
exp ( — 2 ^in 2 r+i (logn)^ 


4rj4i(logn)iA+2 
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Therefore 

(5.1) 


lNum<(^ogn) 2 r exp(-Ain2-+i (logn)^). 


From Lemma 3.2, we know that if the result applies for 6n,i, then it also 
applies to Mrjnj > 6n,i- We have that when Ni < I < N 2 , 


IlNum ^ 


N 2 

E 

/=Ni ' 

N 2 


< ^ exp(—Allogl) exp(—M^/(log/)^’''? log 


n 


7=Afi 
N 2 


1/2 


< 


exp(-2A/log/) 

yI=Ni 

exp ( —M^n 2 ^+i (logn)^) , 


N 2 

exp (— 2 M‘^I{logI)^^r logn 

J=Ni 


where the last line is obtained by integration by part. 
For IIlNum, we have 


IIIn V 


S E / nM-in(/) 


(5.2) 


I=N2+1’'^i j 
exp ( —n 


) n ^ n 

E / n/(^^)^n(/). 

/=Ar2+l j=l 


If we use xj to represent a partition of size I, and Xi to denote the collection 
of all binary partitions of size I, then the integral in (5.2) can be divided 
into the integral over each partition as the following: 


IIIn U 


< 




exp(-n//olog(/o)) E E / ft 
xn( 0 i,..., 0 /|x/)n(x/)d 0 i...d 07 
*’'*“*’'(475 -^)''*’‘p(-"/AM/o)) 

n 1 

E exp(—AL) V D(^cx T ui,..., o T u/) ■■ — r 1 

Tj ^ D(a, ...,a) 

/=Ar2+l ^ ^ i=l ' 


< 

^ r(a/) 


1/2 
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where ^r(aj) ( 47 ? ~ is an upper bound for the normalizing constant of 
the truncated Dirichlet distribution. This inequality can be obtained from 
Lemma 4.3, because, 


n « n 

=N2+ixieXi j=i 


< 


I=N2+1 xjGXj 

(r(a)y, 1 


E E 

/=A2+ix/eV7 


r(al) M/2 

~ n 

/ n x/)Dir( 6 »i,..., a,..., alx/)n(x/)d9i... dOi. 


Now, we focus on the part inside the summation, and apply Stirling’s ap¬ 
proximation to the gamma function. 


D{0! -|- Ul, . . . , a + nj) -r-r 1 

D{a,...,a) 

= exp ( logr(a/) — /logr(a) -|- ^logr(a + n*) 

^ i=l 

^ I 

- log T{al + n) + '^ni log 

i=i ' 

< exp ^Q;/log(a/) — al — /logr(a) — {al + n) log(Q;/ + n) + al + n 


(5.3) 


/ 

+ ^[(a + rii) log(Q; + m) - {a + rii) -t- m log 
i=l 



Let C{a) = l/r(a) — a, then 


(5.3) 

( al ^ n- 

(5.4) < exp fa/ log ^ - n log(a/ + n) + C{a)I + 

Given a partition define fj-i = /o, fli = rii/n, and Ui = ^i/\Qi\. 

Then we have — fii) —)■ AA(0,/ij(l — //»)) in distribution. With this 
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result, 


(5.4) 

f 1 4” ^ \ T A* N”'' 

= exp I al log - n log -h u [a)l + y rii log -h n* log j 

• 1 -1 

1=1 1=1 


= exp 


( al log 


al + n 
al 




al + n 
P'i 


— nlog 


n 

al + n 


n 


+ C{a)I + n / /olog(/o) - niC(/o,/xj) 


I ^ I 

+ ^ n* log — + n ^ log{i'i){fii - m) 


i=l 


i=l 


cy T fyJ -\- n C ^ 

exp ( al log _ ^ , -nlog— - hC{a)I + n /olog(/o) -nK{fo,fxj) ) • 


al + n 


n 


From this result, we know that no matter / <C n or / is comparable to n, 
the integral over each partition is bounded given that A is large enough. If 
we plug in this result into the summation, we have 


IIIn um 


n 

< exp(-/logI) 

I=N2 

1 2 

< exp(—ZIn2’’+i (logn) ). 


Therefore 

(2.7) 

^ (logn)“^“^ exp(—^in2’'+i (logn)^) + exp(—M^n2’’+i (logn)^ + exp(—iAn^^+i (logn)^)) 

~ Er.i Is, n”., i^dn(f) 

^ (logn)“^“^ exp(—^in2?+T(logn)^) + exp(—(logn)^) + exp(—ZAn^^Ai (logn)^) 

2 exp 2 ^^n 2 ’’+i (logn)2 — ( 2 ^^ + 4a;)n2^AT logn — n^’^+i (logr(a) + 1)^ 

where the last inequality is obtained by applying Lemma 4.4 to the space 
1 „ 

0/ with I = n 2 ’’+i. The last line goes to zero when Ai, M and D are all 

greater than 2 ,^- 

Therefore, we have 

n(/ : p{f,fo) > Men\Yi,-- • , Tn) < exp (^-bn^ {lognf^ , 


with probability tending to 1, where 6 is a positive constant. This concludes 
the proof. □ 
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